Utilization of Michael Jordan’s game statistics for estimating average point scoring

Author

Kamil Chmielak

Published

March 16, 2024

The project objective

The main objective of the analysis is to examine the impact of other statistics on Michael Jordan’s point scoring across 9 NBA regular seasons and to create a predictive model explaining the quantity of points scored per game.

Dataset

To create the dataset and conduct the analysis, statistics from each of the 9 NBA seasons were utilized, which were then merged into a single main data frame. The statistics were sourced from www.basketball-reference.com, a website that houses statistics, results, and histories of the NBA, ABA, WNBA leagues, as well as top European competitions.

Table 1: The first 10 rows of the statistics table from the Regular Season
Date AH Opp WL MP PTS FG P2A FGA P3 P3A FT FTA ORB DRB TRB AST STL BLK TOV PF
1984-10-26 H WSB W 40 16 5 16 16 0 0 6 7 1 5 6 7 2 4 5 2
1984-10-27 A MIL L 34 21 8 13 13 0 0 5 5 3 2 5 5 2 1 3 4
1984-10-29 H MIL W 34 37 13 24 24 0 0 11 13 2 2 4 5 6 2 3 4
1984-10-30 A KCK W 36 25 8 21 21 0 0 9 9 2 2 4 5 3 1 6 5
1984-11-01 A DEN L 33 17 7 15 15 0 0 3 4 3 2 5 5 1 1 2 4
1984-11-07 A DET W 27 25 9 19 19 0 0 7 9 1 3 4 3 3 1 5 5
1984-11-08 A NYK W 33 33 15 22 22 0 0 3 4 4 4 8 5 3 2 5 2
1984-11-10 A IND W 42 27 9 22 22 0 0 9 12 2 7 9 4 2 5 3 4
1984-11-13 H SAS W 43 45 18 26 27 1 1 8 11 2 8 10 4 3 2 4 4
1984-11-15 H BOS L 33 27 12 23 24 0 1 3 3 0 2 2 2 2 1 1 4

The table presents statistics of games played by Jordan in regular seasons starting from his rookie season in 1984 until the year 1992, during which the team won the NBA championship for the third consecutive time (Three-peat).

Description of the headers:

  • Date - The date of the game in the format year, month, and day (YYYY-MM-DD)
  • AH - Information about where the game was played - “home” or “away”
    • A - Away
    • H - Home
  • Team - The team against which the Chicago Bulls played on the current day
  • WL - Game resultat
    • W - Win
    • L - Loss
  • MP - Minutes played
  • PTS - The number of scored points
  • FG - Field goal
  • FGA - Field goal attempts
  • TP - Three pointer
  • TPA - Three pointer attempts
  • FT - Free throw
  • P2A - Two pointer attempts
  • FTA - Free throw attempts
  • ORB - Offensive rebounds
  • DRB - Defensive rebounds
  • TRB - Total rebounds
  • AST - Assists
  • STL - Steals
  • BLK - Blocks
  • TO - Turnovers
  • PF - Personal fouls

The presented data frame has the following structure:

Data type in data frame “RS”
Data type
Date Date
AH character
Opp character
WL character
MP numeric
PTS numeric
FG numeric
P2A numeric
FGA numeric
P3 numeric
P3A numeric
FT numeric
FTA numeric
ORB numeric
DRB numeric
TRB numeric
AST numeric
STL numeric
BLK numeric
TOV numeric
PF numeric

The dataset consists of 667 observations and contains 21 columns.

In the dataset, there are 3 columns with character data type, 1 column with date data type, and the remaining values are of numerical type. The data does not contain any missing or NA values.

Basic statistics

Characteristic N = 6671
PTS 32 (8,69,9)
FG 12.1 (3.0,27.0,3.7)
P2A 22.0 (7.0,44.0,5.8)
FGA 23 (7,49,6)
P3 0.43 (0.00,7.00,0.83)
P3A 1.43 (0.00,12.00,1.58)
FT 7.6 (0.0,26.0,4.0)
FTA 9.0 (0.0,27.0,4.5)
ORB 1.70 (0.00,8.00,1.48)
DRB 4.63 (0.00,13.00,2.59)
TRB 6.3 (0.0,18.0,3.1)
AST 5.90 (1.00,17.00,2.79)
STL 2.72 (0.00,10.00,1.71)
BLK 1.03 (0.00,6.00,1.11)
TOV 3.01 (0.00,8.00,1.75)
PF 2.91 (0.00,6.00,1.37)
1 Mean (Minimum,Maximum,SD)

Based on the above table, we were unable to deduce any initial insights from visual analysis that could help in model construction.

Charts

Table 2: Summary of scored points
Minimum Mean Median Sum Maximum
8 32.3 32 21541 69

The variable PTS (points scored) is a key factor in the model-building process and serves as the dependent variable of the constructed model. The range of values for this variable spans from a minimum of 8 to a maximum of 69, indicating the diversity of achieved results. The median, which establishes the central point of the distribution, is 32, with a close mean of 32.3.

The skewness of the points variable, measuring the asymmetry of the distribution, is 0.346. A positive skewness value indicates that the tail of the distribution of points scored extends more to the right than to the left, suggesting a probability of achieving high point scores. Right-skewness may result from irregular cases that inflate the mean - for example, 10 games in which Jordan scored 54 or more points.

The kurtosis of the points variable, measuring the peakedness of the distribution, is 0.386. The kurtosis value is moderately positive, suggesting that the distribution is slightly flatter compared to a normal distribution.

The above-presented plot demonstrates a clear positive correlation between the number of minutes played and points scored. In other words, the longer the player participates in the game, the more points they tend to score. This suggests that playing time is one of the key factors influencing the scoring efficiency of our player.

In the plot depicting the relationship between points scored and assists given, we observe a characteristic shape of the trend line, resembling a flattened lowercase ‘m’, occurring in the range of y values from 31 to 33, corresponding to points scored.

This suggests that for the majority of cases (considering assists), the relationship between the number of assists and points scored is limited or diminishes. The concentration of y values between 31 and 33 indicates some maintenance of points scored, regardless of the number of assists. This may suggest that, for this specific analysis, the number of assists is not a key factor influencing scoring.

From the analysis of the plot, it can be observed that the 0.5 and 0.75 quantiles for away games are lower than for home games. This means that in half of the cases and in the upper quartile, the player scores fewer points when playing away. However, the 0.25 quantiles remain at the same level, suggesting that the lower quartile of scoring does not significantly differ between the two locations.

Table 3: Quantiles of points scored ‘at home’ versus ‘away’
Type Q0.25 Median Q0.75
Away 26 32 37
Home 26 33 38

After conducting the calculations, it was found that the values of points scored in the 0.5, 0.75 quantiles, and median are on average one point higher in the case of away games, indicating potentially better scoring efficiency during away matches.

The analysis of three plots depicting the relationship between points scored and different shot categories (free throws, mid-range shots, and three-point shots) reveals clear, positive correlations between the number of attempted shots and points scored in each of these categories. All three plots demonstrate a strong relationship, suggesting that shooting efficiency significantly impacts team scoring.

The conclusion drawn is that all three analyzed shot categories likely have a strong impact on the final point outcome, indicating that they will be highly statistically significant variables during the construction of a predictive model.

In a situation where the number of shots from different positions is significantly diverse, this can impact the model, especially if a particular type of shot is more or less valuable in terms of scoring points in a game.

In this case, mid-range shots have a significantly higher count than other categories, and the model may tend to more accurately consider the influence of mid-range shots. However, the mere fact that one category is more numerous does not automatically mean that it will have a greater impact on the model.

On the presented plot depicting the relationship between points scored in a game and the number of offensive rebounds, interesting trends can be observed. As the number of offensive rebounds increases, we observe a slight increase in points, suggesting a positive correlation between these two variables.

When the number of rebounds is 1, a slight increase is observed, confirming the impact of even a single offensive rebound on the point outcome. Then, with 2 rebounds, a bump to around 33 points is observed, followed by a slight decrease to 32.

However, from 4 rebounds onwards, a clear increase in the trend line can be seen, although it is worth noting that the confidence interval of the regression line for this area significantly widens. This suggests that as the number of offensive rebounds increases, this variable becomes a less certain predictor of the point outcome.

As the number of turnovers increases, we observe a fluctuating character of the plot, generally hovering around 32.5 (32-33) points. However, we notice a slight decrease in points scored after reaching 5 turnovers, and from 7 turnovers onwards, we observe a sharp increase in the confidence interval of the trend line. This suggests that the number of turnovers may have a limited impact on points scored, and due to the widening confidence interval for higher values, it can be inferred that the variable of turnovers may be statistically insignificant in the predictive model.

The regression line rises almost at a 45-degree angle, indicating a positive and nonlinear relationship between the variables. An interesting aspect is the flattening of the regression line at 3 steals. This may suggest that initial steals contribute to an increase in points scored, but after reaching a certain level, additional steals have less impact on the point outcome - for values from 4 to 7 steals, the number of points scored is almost constant. An increase in the number of steals beyond this range is associated with further increases in points, forming a convex curve. This suggests that the number of steals may be a significant explanatory factor in the model, but its impact is nonlinear.

The correlation table of variables

From the preliminary analysis of the correlation matrix, we can observe that 6 variables are statistically significant. A clear positive correlation (0.71) characterizes the variable P2A - “Mid-range shots attempted”, and (0.57) FTA - “Free throw attempts”, indicating a strong relationship between attempts of 2-point shots and the number of free throw attempts with the points scored. A moderate positive correlation (0.44) “minutes played” - MP also suggests that the more time spent on the court, the tendency for a higher number of points scored.

The remaining 3 variables - Steals, Offensive rebounds, Three-point attempts - have low positive correlations (0.15-0.20) with the dependent variable, indicating that their influence will not be as significant in the constructed model.

Three other variables are considered statistically insignificant, meaning their impact on the number of points scored is not statistically significant.

Model construction

Using stepwise regression and backward regression, we managed to create one model that explained the variable PTS using the variables AH, MP, FTA, P2A, P3A, ORB. Utilizing our own insights and knowledge that Michael Jordan was known for a significant number of steals in played matches, we also created a model with the variable STL. After conducting analyses, which are not included in the report, we decided not to add the variable STL to the model because it was statistically insignificant, and to remove the variable ORB because its interpretation caused some confusion. Below is a comparison of these 3 models using measures of model quality and criteria.


Call:
lm(formula = PTS ~ AH + MP + FTA + P2A + P3A, data = RS)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.5509  -3.1272   0.0035   3.2000  19.7824 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)  3.20300    1.40985   2.272             0.023415 *  
AHH          1.33433    0.38218   3.491             0.000513 ***
MP          -0.09970    0.04338  -2.298             0.021847 *  
FTA          0.88403    0.04507  19.615 < 0.0000000000000002 ***
P2A          1.01380    0.03958  25.611 < 0.0000000000000002 ***
P3A          1.37890    0.12239  11.266 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.9 on 661 degrees of freedom
Multiple R-squared:  0.7172,    Adjusted R-squared:  0.7151 
F-statistic: 335.3 on 5 and 661 DF,  p-value: < 0.00000000000000022

We can see that all estimators of structural parameters are statistically significant. Additionally, the coefficient of determination is 0.7172, which means that approximately 71% of the variability in PTS is explained by the independent variables. The standard error in the model is 4.9. This means that, on average, the model can be off by 4.9 points scored per game.

PTS = 3.20 + 1.33*AHH - 0.10*MP + 0.88*FTA + 1.01*P2A + 1.38*P3A

Model diagnostics

Normal Q-Q plot

In the considered model, observations do not deviate from the straight line, indicating that we can assume the residuals are normally distributed.

Residuals vs Fitted plot

In the considered model, on the Residuals vs Fitted plot, we can observe a straight line, indicating that the linear relationship has been explained by the model and has not been omitted in the residuals.

Scale-Location plot

On the above plot, it can be seen that the red curve is close to the horizontal line, and the square roots of standardized residuals are evenly distributed around the red line. Therefore, the assumption of homoscedasticity of residuals may be satisfied. It is recommended to verify this observation using an appropriate statistical test to further confirm the hypothesis of homoscedasticity of the model.


    RESET test

data:  mdl
RESET = 0.31075, df1 = 2, df2 = 659, p-value = 0.733

The obtained p-value in the RESET test, which is 0.733, suggests no evidence of nonlinearity.

Linear independence

      AH       MP      FTA      P2A      P3A 
1.014452 1.540117 1.157166 1.451581 1.031757 

1-5: Moderate collinearity - no significant issues

Homoscedasticity


    Goldfeld-Quandt test

data:  mdl
GQ = 0.76987, df1 = 328, df2 = 327, p-value = 0.9909
alternative hypothesis: variance increases from segment 1 to 2

In the conducted Goldfeld-Quandt test on the model under consideration, the p-value reaches 0.9909, suggesting that the assumption of constant variance of linear regression is met in the model.

Autocorrelation of errors


    Durbin-Watson test

data:  mdl
DW = 2.017, p-value = 0.5751
alternative hypothesis: true autocorrelation is greater than 0

    Breusch-Godfrey test for serial correlation of order up to 2

data:  mdl
LM test = 5.4617, df = 2, p-value = 0.06516

The Durbin-Watson test and the Breusch-Godfrey test for serial correlation up to order 2 have p-values of 0.5751 and 0.6 respectively. These p-values are greater than 0.05, suggesting that there is not enough evidence to conclude that there is autocorrelation in the model residuals.

Linearity of errors


    Shapiro-Wilk normality test

data:  mdl$residuals
W = 0.9981, p-value = 0.6766

The Shapiro-Wilk test conducted does not reject the hypothesis of normality of the model residuals (p-value > 0.05).

Model predictions

On March 31, 1989, Michael Jordan played a home game against the Cleveland Cavaliers. He scored 37 points, making 7 free throw attempts, 27 mid-range shots, and 2 three-pointers, all in 43 minutes of play. Despite MJ’s excellent statistics, the Chicago Bulls ended up losing the game.

AH MP FTA P2A P3A
H 43 7 27 2
fit lwr upr
36.56912 26.92057 46.21767

In the second of the testing matches, the built predictive model, on November 12, 1989, Michael Jordan played against the New Jersey Nets (now Brooklyn Nets) in an away game, resulting in a negative outcome as the Chicago Bulls lost.

Jordan played for 43 minutes during which he scored 42 points. To achieve this result, he made 12 free throw attempts, 28 mid-range shots, and 3 three-pointers.

AH MP FTA P2A P3A
A 43 12 28 3
fit lwr upr
42.04767 32.39495 51.70038

On April 12, 1991, Michael Jordan played an away game against the Detroit Pistons. He scored 40 points, making 15 free throw attempts, 22 mid-range shots, and 2 three-pointers, all in 43 minutes of play. Despite MJ’s excellent statistics, the Chicago Bulls ended up losing the game.

AH MP FTA P2A P3A
A 43 15 22 2
fit lwr upr
37.23804 27.58526 46.89082